Trying to load multiple json files and merge into one pandas dataframe
I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.
I have tried quite a few solutions but nothing seems to be yielding a positive result.
This is what I have tried so far
path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data
And this is the full traceback of error message I am receiving:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']
/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder
/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I have added a sample of the json files I am trying to read from
Link to Jsons
Example of jsons:
{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}
python json pandas
add a comment |
I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.
I have tried quite a few solutions but nothing seems to be yielding a positive result.
This is what I have tried so far
path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data
And this is the full traceback of error message I am receiving:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']
/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder
/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I have added a sample of the json files I am trying to read from
Link to Jsons
Example of jsons:
{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}
python json pandas
Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.
– amanb
Jan 14 at 7:44
@amanb added 2 examples of the JSON files
– Emm
Jan 14 at 7:52
These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.
– amanb
Jan 14 at 8:01
Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!
– amanb
Jan 14 at 20:01
add a comment |
I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.
I have tried quite a few solutions but nothing seems to be yielding a positive result.
This is what I have tried so far
path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data
And this is the full traceback of error message I am receiving:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']
/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder
/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I have added a sample of the json files I am trying to read from
Link to Jsons
Example of jsons:
{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}
python json pandas
I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.
I have tried quite a few solutions but nothing seems to be yielding a positive result.
This is what I have tried so far
path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data
And this is the full traceback of error message I am receiving:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']
/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder
/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I have added a sample of the json files I am trying to read from
Link to Jsons
Example of jsons:
{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}
python json pandas
python json pandas
edited Jan 14 at 7:51
Emm
asked Jan 14 at 7:18
EmmEmm
1008
1008
Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.
– amanb
Jan 14 at 7:44
@amanb added 2 examples of the JSON files
– Emm
Jan 14 at 7:52
These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.
– amanb
Jan 14 at 8:01
Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!
– amanb
Jan 14 at 20:01
add a comment |
Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.
– amanb
Jan 14 at 7:44
@amanb added 2 examples of the JSON files
– Emm
Jan 14 at 7:52
These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.
– amanb
Jan 14 at 8:01
Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!
– amanb
Jan 14 at 20:01
Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.
– amanb
Jan 14 at 7:44
Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.
– amanb
Jan 14 at 7:44
@amanb added 2 examples of the JSON files
– Emm
Jan 14 at 7:52
@amanb added 2 examples of the JSON files
– Emm
Jan 14 at 7:52
These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.
– amanb
Jan 14 at 8:01
These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.
– amanb
Jan 14 at 8:01
Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!
– amanb
Jan 14 at 20:01
Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!
– amanb
Jan 14 at 20:01
add a comment |
2 Answers
2
active
oldest
votes
I've checked your json files, and found that there are same problems in document1.json
, document2.json
and document3.json
: the property name are not enclosed with double quotes.
For example, document1.json
should be corrected as:
{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}
EDIT: you can use following line to add double quote to the keys of a json file:
re.sub("([^s^"]+):(.+)", '"\1":\2', s)
I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?
– Emm
Jan 14 at 7:48
The regex string is helpful!
– amanb
Jan 14 at 9:57
Hi @Emm if this or any answer has solved your question please consider accepting it .
– keineahnung2345
Jan 15 at 9:18
add a comment |
There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.
Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.
I've broken down the problem into the following parts:
- convert malformed JSON's in the files to valid JSON's
- flatten the dict's in the valid JSON files to prepare for dataframe conversion
- create dataframes from the files and merge into one dataframe
Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.
Part 1: convert malformed JSON's in the files to valid JSON's:
The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.
with open('./Test/test.json') as f:
data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
The only way I found to get around this problem was:
- to convert all the JSON files to txt files as this would convert the contents to string
- perform regex on the JSON string in text files and add quotes(" ") around the keys
- save the file as JSON again
The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.
import json
import os
import re
import pandas as pd
#rename to txt files and return list of filenames
def rename_to_text_files():
all_new_filenames =
for filename in os.listdir('./Test'):
if filename.endswith("json"):
new_filename = filename.split('.')[0] + '.txt'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
all_new_filenames.append(new_filename)
else:
all_new_filenames.append(filename)
return all_new_filenames
#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
json_validated_files =
for index, filename in enumerate(files):
filepath = os.path.join('./Test',filename)
with open(filepath,'r+') as f:
data = f.read()
dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
f.seek(0)
f.write(dict_converted)
f.truncate()
#rename
new_filename = filename[:-4] + '.json'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
json_validated_files.append(new_filename)
print("All files converted to valid JSON!")
return json_validated_files
So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":
#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}
If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:
df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length
In the next part, we fix this problem by flattening the dict in the JSON.
Part 2: Flatten dict for dataframe conversion:
As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:
#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
values =
d['participants_name'] = d.pop('participants')
for i in d['participants_name']:
values.append(i['name'])
for i in d['messages']:
d['messages_sender_name'] = i['sender_name']
d['messages_timestamp_ms'] = str(i['timestamp_ms'])
d['messages_content'] = i['content']
d['messages_type'] = i['type']
if "share" in i:
d['messages_share_link'] = i["share"]["link"]
d["is_still_participant"] = str(d["is_still_participant"])
d.pop('messages')
d.update(participants_name=values)
return d
This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:
#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}
When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame
function(discussed later):
with open('./Test/test1.json') as f:
data = json.load(f)
print(flatten(data))
#Output:
{'title': 'Clearance',
'is_still_participant': 'True',
'thread_type': 'Regular',
'thread_path': 'inbox/Clearance',
'participants_name': ['Clearance', 'Person'],
'messages_sender_name': 'Emmanuel Sibanda',
'messages_timestamp_ms': '1212242073308',
'messages_content': 'Dear',
'messages_type': 'Share',
'messages_share_link': 'http://www.example.com/'}
Part 3: Create Dataframes and merge them into one:
So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:
- open the JSON files one by one, load each JSON as a dict in memory using
json.load()
. - call the flatten function on each dict
- convert the flattened dicts to dataframes
- append all dataframes to an empty list.
- merge all dataframes with
pd.concat()
passing the list of dataframes as an argument.
The code to accomplish these tasks:
#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.
def create_merge_dataframes(list_of_valid_json_files):
df_list =
for index, js in enumerate(list_of_valid_json_files):
with open(os.path.join('./Test', js)) as json_file:
data = json.load(json_file)
flattened_json_data = flatten(data)
df = pd.DataFrame(flattened_json_data)
df_list.append(df)
merged_df = pd.concat(df_list,sort=False, ignore_index=True)
return merged_df
Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.
#rename invalid JSON files to text
files = rename_to_text_files()
#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)
#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)
The final Dataframe:
title is_still_participant thread_type thread_path
0 Test 1 True Regular inbox/xyz
1 Test 1 True Regular inbox/xyz
2 Clearance True Regular inbox/Clearance
3 Clearance True Regular inbox/Clearance
participants_name messages_sender_name messages_timestamp_ms
0 Test 1 Person 1485467319139
1 Person Person 1485467319139
2 Clearance Emmanuel Sibanda 1212242073308
3 Person Emmanuel Sibanda 1212242073308
messages_content messages_type messages_share_link
0 Hie Generic NaN
1 Hie Generic NaN
2 Dear Share http://www.example.com/
3 Dear Share http://www.example.com/
You can change the order of columns as you like.
Note:
- The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples
- The shape and columns of the Dataframes has also been assumed
- You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.
- The whole script can be further modularized by putting the functions into a class.
- It can also be further optimized with the use of hashable data types like tuples and sped up with
threading
andasyncio
libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long. - It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.
The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54177131%2ftrying-to-load-multiple-json-files-and-merge-into-one-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I've checked your json files, and found that there are same problems in document1.json
, document2.json
and document3.json
: the property name are not enclosed with double quotes.
For example, document1.json
should be corrected as:
{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}
EDIT: you can use following line to add double quote to the keys of a json file:
re.sub("([^s^"]+):(.+)", '"\1":\2', s)
I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?
– Emm
Jan 14 at 7:48
The regex string is helpful!
– amanb
Jan 14 at 9:57
Hi @Emm if this or any answer has solved your question please consider accepting it .
– keineahnung2345
Jan 15 at 9:18
add a comment |
I've checked your json files, and found that there are same problems in document1.json
, document2.json
and document3.json
: the property name are not enclosed with double quotes.
For example, document1.json
should be corrected as:
{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}
EDIT: you can use following line to add double quote to the keys of a json file:
re.sub("([^s^"]+):(.+)", '"\1":\2', s)
I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?
– Emm
Jan 14 at 7:48
The regex string is helpful!
– amanb
Jan 14 at 9:57
Hi @Emm if this or any answer has solved your question please consider accepting it .
– keineahnung2345
Jan 15 at 9:18
add a comment |
I've checked your json files, and found that there are same problems in document1.json
, document2.json
and document3.json
: the property name are not enclosed with double quotes.
For example, document1.json
should be corrected as:
{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}
EDIT: you can use following line to add double quote to the keys of a json file:
re.sub("([^s^"]+):(.+)", '"\1":\2', s)
I've checked your json files, and found that there are same problems in document1.json
, document2.json
and document3.json
: the property name are not enclosed with double quotes.
For example, document1.json
should be corrected as:
{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}
EDIT: you can use following line to add double quote to the keys of a json file:
re.sub("([^s^"]+):(.+)", '"\1":\2', s)
edited Jan 14 at 8:59
answered Jan 14 at 7:44
keineahnung2345keineahnung2345
60012
60012
I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?
– Emm
Jan 14 at 7:48
The regex string is helpful!
– amanb
Jan 14 at 9:57
Hi @Emm if this or any answer has solved your question please consider accepting it .
– keineahnung2345
Jan 15 at 9:18
add a comment |
I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?
– Emm
Jan 14 at 7:48
The regex string is helpful!
– amanb
Jan 14 at 9:57
Hi @Emm if this or any answer has solved your question please consider accepting it .
– keineahnung2345
Jan 15 at 9:18
I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?
– Emm
Jan 14 at 7:48
I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?
– Emm
Jan 14 at 7:48
The regex string is helpful!
– amanb
Jan 14 at 9:57
The regex string is helpful!
– amanb
Jan 14 at 9:57
Hi @Emm if this or any answer has solved your question please consider accepting it .
– keineahnung2345
Jan 15 at 9:18
Hi @Emm if this or any answer has solved your question please consider accepting it .
– keineahnung2345
Jan 15 at 9:18
add a comment |
There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.
Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.
I've broken down the problem into the following parts:
- convert malformed JSON's in the files to valid JSON's
- flatten the dict's in the valid JSON files to prepare for dataframe conversion
- create dataframes from the files and merge into one dataframe
Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.
Part 1: convert malformed JSON's in the files to valid JSON's:
The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.
with open('./Test/test.json') as f:
data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
The only way I found to get around this problem was:
- to convert all the JSON files to txt files as this would convert the contents to string
- perform regex on the JSON string in text files and add quotes(" ") around the keys
- save the file as JSON again
The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.
import json
import os
import re
import pandas as pd
#rename to txt files and return list of filenames
def rename_to_text_files():
all_new_filenames =
for filename in os.listdir('./Test'):
if filename.endswith("json"):
new_filename = filename.split('.')[0] + '.txt'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
all_new_filenames.append(new_filename)
else:
all_new_filenames.append(filename)
return all_new_filenames
#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
json_validated_files =
for index, filename in enumerate(files):
filepath = os.path.join('./Test',filename)
with open(filepath,'r+') as f:
data = f.read()
dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
f.seek(0)
f.write(dict_converted)
f.truncate()
#rename
new_filename = filename[:-4] + '.json'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
json_validated_files.append(new_filename)
print("All files converted to valid JSON!")
return json_validated_files
So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":
#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}
If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:
df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length
In the next part, we fix this problem by flattening the dict in the JSON.
Part 2: Flatten dict for dataframe conversion:
As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:
#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
values =
d['participants_name'] = d.pop('participants')
for i in d['participants_name']:
values.append(i['name'])
for i in d['messages']:
d['messages_sender_name'] = i['sender_name']
d['messages_timestamp_ms'] = str(i['timestamp_ms'])
d['messages_content'] = i['content']
d['messages_type'] = i['type']
if "share" in i:
d['messages_share_link'] = i["share"]["link"]
d["is_still_participant"] = str(d["is_still_participant"])
d.pop('messages')
d.update(participants_name=values)
return d
This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:
#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}
When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame
function(discussed later):
with open('./Test/test1.json') as f:
data = json.load(f)
print(flatten(data))
#Output:
{'title': 'Clearance',
'is_still_participant': 'True',
'thread_type': 'Regular',
'thread_path': 'inbox/Clearance',
'participants_name': ['Clearance', 'Person'],
'messages_sender_name': 'Emmanuel Sibanda',
'messages_timestamp_ms': '1212242073308',
'messages_content': 'Dear',
'messages_type': 'Share',
'messages_share_link': 'http://www.example.com/'}
Part 3: Create Dataframes and merge them into one:
So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:
- open the JSON files one by one, load each JSON as a dict in memory using
json.load()
. - call the flatten function on each dict
- convert the flattened dicts to dataframes
- append all dataframes to an empty list.
- merge all dataframes with
pd.concat()
passing the list of dataframes as an argument.
The code to accomplish these tasks:
#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.
def create_merge_dataframes(list_of_valid_json_files):
df_list =
for index, js in enumerate(list_of_valid_json_files):
with open(os.path.join('./Test', js)) as json_file:
data = json.load(json_file)
flattened_json_data = flatten(data)
df = pd.DataFrame(flattened_json_data)
df_list.append(df)
merged_df = pd.concat(df_list,sort=False, ignore_index=True)
return merged_df
Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.
#rename invalid JSON files to text
files = rename_to_text_files()
#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)
#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)
The final Dataframe:
title is_still_participant thread_type thread_path
0 Test 1 True Regular inbox/xyz
1 Test 1 True Regular inbox/xyz
2 Clearance True Regular inbox/Clearance
3 Clearance True Regular inbox/Clearance
participants_name messages_sender_name messages_timestamp_ms
0 Test 1 Person 1485467319139
1 Person Person 1485467319139
2 Clearance Emmanuel Sibanda 1212242073308
3 Person Emmanuel Sibanda 1212242073308
messages_content messages_type messages_share_link
0 Hie Generic NaN
1 Hie Generic NaN
2 Dear Share http://www.example.com/
3 Dear Share http://www.example.com/
You can change the order of columns as you like.
Note:
- The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples
- The shape and columns of the Dataframes has also been assumed
- You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.
- The whole script can be further modularized by putting the functions into a class.
- It can also be further optimized with the use of hashable data types like tuples and sped up with
threading
andasyncio
libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long. - It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.
The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.
add a comment |
There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.
Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.
I've broken down the problem into the following parts:
- convert malformed JSON's in the files to valid JSON's
- flatten the dict's in the valid JSON files to prepare for dataframe conversion
- create dataframes from the files and merge into one dataframe
Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.
Part 1: convert malformed JSON's in the files to valid JSON's:
The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.
with open('./Test/test.json') as f:
data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
The only way I found to get around this problem was:
- to convert all the JSON files to txt files as this would convert the contents to string
- perform regex on the JSON string in text files and add quotes(" ") around the keys
- save the file as JSON again
The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.
import json
import os
import re
import pandas as pd
#rename to txt files and return list of filenames
def rename_to_text_files():
all_new_filenames =
for filename in os.listdir('./Test'):
if filename.endswith("json"):
new_filename = filename.split('.')[0] + '.txt'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
all_new_filenames.append(new_filename)
else:
all_new_filenames.append(filename)
return all_new_filenames
#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
json_validated_files =
for index, filename in enumerate(files):
filepath = os.path.join('./Test',filename)
with open(filepath,'r+') as f:
data = f.read()
dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
f.seek(0)
f.write(dict_converted)
f.truncate()
#rename
new_filename = filename[:-4] + '.json'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
json_validated_files.append(new_filename)
print("All files converted to valid JSON!")
return json_validated_files
So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":
#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}
If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:
df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length
In the next part, we fix this problem by flattening the dict in the JSON.
Part 2: Flatten dict for dataframe conversion:
As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:
#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
values =
d['participants_name'] = d.pop('participants')
for i in d['participants_name']:
values.append(i['name'])
for i in d['messages']:
d['messages_sender_name'] = i['sender_name']
d['messages_timestamp_ms'] = str(i['timestamp_ms'])
d['messages_content'] = i['content']
d['messages_type'] = i['type']
if "share" in i:
d['messages_share_link'] = i["share"]["link"]
d["is_still_participant"] = str(d["is_still_participant"])
d.pop('messages')
d.update(participants_name=values)
return d
This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:
#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}
When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame
function(discussed later):
with open('./Test/test1.json') as f:
data = json.load(f)
print(flatten(data))
#Output:
{'title': 'Clearance',
'is_still_participant': 'True',
'thread_type': 'Regular',
'thread_path': 'inbox/Clearance',
'participants_name': ['Clearance', 'Person'],
'messages_sender_name': 'Emmanuel Sibanda',
'messages_timestamp_ms': '1212242073308',
'messages_content': 'Dear',
'messages_type': 'Share',
'messages_share_link': 'http://www.example.com/'}
Part 3: Create Dataframes and merge them into one:
So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:
- open the JSON files one by one, load each JSON as a dict in memory using
json.load()
. - call the flatten function on each dict
- convert the flattened dicts to dataframes
- append all dataframes to an empty list.
- merge all dataframes with
pd.concat()
passing the list of dataframes as an argument.
The code to accomplish these tasks:
#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.
def create_merge_dataframes(list_of_valid_json_files):
df_list =
for index, js in enumerate(list_of_valid_json_files):
with open(os.path.join('./Test', js)) as json_file:
data = json.load(json_file)
flattened_json_data = flatten(data)
df = pd.DataFrame(flattened_json_data)
df_list.append(df)
merged_df = pd.concat(df_list,sort=False, ignore_index=True)
return merged_df
Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.
#rename invalid JSON files to text
files = rename_to_text_files()
#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)
#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)
The final Dataframe:
title is_still_participant thread_type thread_path
0 Test 1 True Regular inbox/xyz
1 Test 1 True Regular inbox/xyz
2 Clearance True Regular inbox/Clearance
3 Clearance True Regular inbox/Clearance
participants_name messages_sender_name messages_timestamp_ms
0 Test 1 Person 1485467319139
1 Person Person 1485467319139
2 Clearance Emmanuel Sibanda 1212242073308
3 Person Emmanuel Sibanda 1212242073308
messages_content messages_type messages_share_link
0 Hie Generic NaN
1 Hie Generic NaN
2 Dear Share http://www.example.com/
3 Dear Share http://www.example.com/
You can change the order of columns as you like.
Note:
- The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples
- The shape and columns of the Dataframes has also been assumed
- You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.
- The whole script can be further modularized by putting the functions into a class.
- It can also be further optimized with the use of hashable data types like tuples and sped up with
threading
andasyncio
libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long. - It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.
The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.
add a comment |
There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.
Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.
I've broken down the problem into the following parts:
- convert malformed JSON's in the files to valid JSON's
- flatten the dict's in the valid JSON files to prepare for dataframe conversion
- create dataframes from the files and merge into one dataframe
Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.
Part 1: convert malformed JSON's in the files to valid JSON's:
The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.
with open('./Test/test.json') as f:
data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
The only way I found to get around this problem was:
- to convert all the JSON files to txt files as this would convert the contents to string
- perform regex on the JSON string in text files and add quotes(" ") around the keys
- save the file as JSON again
The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.
import json
import os
import re
import pandas as pd
#rename to txt files and return list of filenames
def rename_to_text_files():
all_new_filenames =
for filename in os.listdir('./Test'):
if filename.endswith("json"):
new_filename = filename.split('.')[0] + '.txt'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
all_new_filenames.append(new_filename)
else:
all_new_filenames.append(filename)
return all_new_filenames
#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
json_validated_files =
for index, filename in enumerate(files):
filepath = os.path.join('./Test',filename)
with open(filepath,'r+') as f:
data = f.read()
dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
f.seek(0)
f.write(dict_converted)
f.truncate()
#rename
new_filename = filename[:-4] + '.json'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
json_validated_files.append(new_filename)
print("All files converted to valid JSON!")
return json_validated_files
So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":
#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}
If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:
df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length
In the next part, we fix this problem by flattening the dict in the JSON.
Part 2: Flatten dict for dataframe conversion:
As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:
#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
values =
d['participants_name'] = d.pop('participants')
for i in d['participants_name']:
values.append(i['name'])
for i in d['messages']:
d['messages_sender_name'] = i['sender_name']
d['messages_timestamp_ms'] = str(i['timestamp_ms'])
d['messages_content'] = i['content']
d['messages_type'] = i['type']
if "share" in i:
d['messages_share_link'] = i["share"]["link"]
d["is_still_participant"] = str(d["is_still_participant"])
d.pop('messages')
d.update(participants_name=values)
return d
This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:
#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}
When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame
function(discussed later):
with open('./Test/test1.json') as f:
data = json.load(f)
print(flatten(data))
#Output:
{'title': 'Clearance',
'is_still_participant': 'True',
'thread_type': 'Regular',
'thread_path': 'inbox/Clearance',
'participants_name': ['Clearance', 'Person'],
'messages_sender_name': 'Emmanuel Sibanda',
'messages_timestamp_ms': '1212242073308',
'messages_content': 'Dear',
'messages_type': 'Share',
'messages_share_link': 'http://www.example.com/'}
Part 3: Create Dataframes and merge them into one:
So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:
- open the JSON files one by one, load each JSON as a dict in memory using
json.load()
. - call the flatten function on each dict
- convert the flattened dicts to dataframes
- append all dataframes to an empty list.
- merge all dataframes with
pd.concat()
passing the list of dataframes as an argument.
The code to accomplish these tasks:
#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.
def create_merge_dataframes(list_of_valid_json_files):
df_list =
for index, js in enumerate(list_of_valid_json_files):
with open(os.path.join('./Test', js)) as json_file:
data = json.load(json_file)
flattened_json_data = flatten(data)
df = pd.DataFrame(flattened_json_data)
df_list.append(df)
merged_df = pd.concat(df_list,sort=False, ignore_index=True)
return merged_df
Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.
#rename invalid JSON files to text
files = rename_to_text_files()
#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)
#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)
The final Dataframe:
title is_still_participant thread_type thread_path
0 Test 1 True Regular inbox/xyz
1 Test 1 True Regular inbox/xyz
2 Clearance True Regular inbox/Clearance
3 Clearance True Regular inbox/Clearance
participants_name messages_sender_name messages_timestamp_ms
0 Test 1 Person 1485467319139
1 Person Person 1485467319139
2 Clearance Emmanuel Sibanda 1212242073308
3 Person Emmanuel Sibanda 1212242073308
messages_content messages_type messages_share_link
0 Hie Generic NaN
1 Hie Generic NaN
2 Dear Share http://www.example.com/
3 Dear Share http://www.example.com/
You can change the order of columns as you like.
Note:
- The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples
- The shape and columns of the Dataframes has also been assumed
- You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.
- The whole script can be further modularized by putting the functions into a class.
- It can also be further optimized with the use of hashable data types like tuples and sped up with
threading
andasyncio
libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long. - It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.
The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.
There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.
Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.
I've broken down the problem into the following parts:
- convert malformed JSON's in the files to valid JSON's
- flatten the dict's in the valid JSON files to prepare for dataframe conversion
- create dataframes from the files and merge into one dataframe
Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.
Part 1: convert malformed JSON's in the files to valid JSON's:
The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.
with open('./Test/test.json') as f:
data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
The only way I found to get around this problem was:
- to convert all the JSON files to txt files as this would convert the contents to string
- perform regex on the JSON string in text files and add quotes(" ") around the keys
- save the file as JSON again
The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.
import json
import os
import re
import pandas as pd
#rename to txt files and return list of filenames
def rename_to_text_files():
all_new_filenames =
for filename in os.listdir('./Test'):
if filename.endswith("json"):
new_filename = filename.split('.')[0] + '.txt'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
all_new_filenames.append(new_filename)
else:
all_new_filenames.append(filename)
return all_new_filenames
#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
json_validated_files =
for index, filename in enumerate(files):
filepath = os.path.join('./Test',filename)
with open(filepath,'r+') as f:
data = f.read()
dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
f.seek(0)
f.write(dict_converted)
f.truncate()
#rename
new_filename = filename[:-4] + '.json'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
json_validated_files.append(new_filename)
print("All files converted to valid JSON!")
return json_validated_files
So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":
#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}
If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:
df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length
In the next part, we fix this problem by flattening the dict in the JSON.
Part 2: Flatten dict for dataframe conversion:
As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:
#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
values =
d['participants_name'] = d.pop('participants')
for i in d['participants_name']:
values.append(i['name'])
for i in d['messages']:
d['messages_sender_name'] = i['sender_name']
d['messages_timestamp_ms'] = str(i['timestamp_ms'])
d['messages_content'] = i['content']
d['messages_type'] = i['type']
if "share" in i:
d['messages_share_link'] = i["share"]["link"]
d["is_still_participant"] = str(d["is_still_participant"])
d.pop('messages')
d.update(participants_name=values)
return d
This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:
#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}
When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame
function(discussed later):
with open('./Test/test1.json') as f:
data = json.load(f)
print(flatten(data))
#Output:
{'title': 'Clearance',
'is_still_participant': 'True',
'thread_type': 'Regular',
'thread_path': 'inbox/Clearance',
'participants_name': ['Clearance', 'Person'],
'messages_sender_name': 'Emmanuel Sibanda',
'messages_timestamp_ms': '1212242073308',
'messages_content': 'Dear',
'messages_type': 'Share',
'messages_share_link': 'http://www.example.com/'}
Part 3: Create Dataframes and merge them into one:
So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:
- open the JSON files one by one, load each JSON as a dict in memory using
json.load()
. - call the flatten function on each dict
- convert the flattened dicts to dataframes
- append all dataframes to an empty list.
- merge all dataframes with
pd.concat()
passing the list of dataframes as an argument.
The code to accomplish these tasks:
#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.
def create_merge_dataframes(list_of_valid_json_files):
df_list =
for index, js in enumerate(list_of_valid_json_files):
with open(os.path.join('./Test', js)) as json_file:
data = json.load(json_file)
flattened_json_data = flatten(data)
df = pd.DataFrame(flattened_json_data)
df_list.append(df)
merged_df = pd.concat(df_list,sort=False, ignore_index=True)
return merged_df
Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.
#rename invalid JSON files to text
files = rename_to_text_files()
#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)
#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)
The final Dataframe:
title is_still_participant thread_type thread_path
0 Test 1 True Regular inbox/xyz
1 Test 1 True Regular inbox/xyz
2 Clearance True Regular inbox/Clearance
3 Clearance True Regular inbox/Clearance
participants_name messages_sender_name messages_timestamp_ms
0 Test 1 Person 1485467319139
1 Person Person 1485467319139
2 Clearance Emmanuel Sibanda 1212242073308
3 Person Emmanuel Sibanda 1212242073308
messages_content messages_type messages_share_link
0 Hie Generic NaN
1 Hie Generic NaN
2 Dear Share http://www.example.com/
3 Dear Share http://www.example.com/
You can change the order of columns as you like.
Note:
- The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples
- The shape and columns of the Dataframes has also been assumed
- You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.
- The whole script can be further modularized by putting the functions into a class.
- It can also be further optimized with the use of hashable data types like tuples and sped up with
threading
andasyncio
libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long. - It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.
The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.
edited Jan 22 at 11:57
answered Jan 18 at 15:29
amanbamanb
1,145421
1,145421
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54177131%2ftrying-to-load-multiple-json-files-and-merge-into-one-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.
– amanb
Jan 14 at 7:44
@amanb added 2 examples of the JSON files
– Emm
Jan 14 at 7:52
These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.
– amanb
Jan 14 at 8:01
Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!
– amanb
Jan 14 at 20:01